Analysis 1

Author

Sarah Rosenberg Asmussen (s194689), Mette Bøge Pedersen (s194679), Caroline Amalie Bastholm Jensen (s213427), Jaime Noguera Piera (s233773), Yassine Turki (s231735)

Data Load

data <- read_tsv(gzfile("../data/03_dat_aug.tsv.gz"))
Rows: 4550676 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): Smoking, gene, is_significant
dbl (5): Metastasis, gene_expression, p_value, log2_fold_change_avg, log2_fo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
# A tibble: 4,550,676 × 8
   Metastasis Smoking gene      gene_expression p_value is_significant
        <dbl> <chr>   <chr>               <dbl>   <dbl> <chr>         
 1          0 Former  1007_s_at            6.80   0.562 No            
 2          0 Former  1053_at              8.39   0.339 No            
 3          0 Former  117_at               3.89   0.905 No            
 4          0 Former  121_at               5.43   0.809 No            
 5          0 Former  1255_g_at            2.23   0.578 No            
 6          0 Former  1294_at              2.75   0.561 No            
 7          0 Former  1316_at              2.69   0.163 No            
 8          0 Former  1320_at              2.23   0.364 No            
 9          0 Former  1405_i_at            6.71   0.903 No            
10          0 Former  1431_at              2.23   0.215 No            
# ℹ 4,550,666 more rows
# ℹ 2 more variables: log2_fold_change_avg <dbl>, log2_fold_change_sample <dbl>

Analysis 1

In the first analysis, we want to identify which genes are found to be significantly different expressed in patients with metastatic cancer compared to non-metastatic cancer. Furthermore, we want to investigate if the gene expression is up-regulated of down-regulated.

The significance was calculated on the basis of a Student’s T-test where the expression of each gene was compared based on if the patients had metastasis or not.

The Log2 Fold Change for each gene was calculated based on the average gene expression level by comparing samples with metastasis and no metastasis.

Conclusion of first analysis:

These results are shown in a volcano plot where the -10log(p-value) is shown on the y-axis and the Log2-Fold-Change is hown on the x-axis. Each dot represents a gene. From this, we can observe some genes, specifically 286 genes out of 48,932 genes, that are significantly different expressed in patients with metastasis on a significance level of 0.01. We can also observe that the significant genes typically are more up-regulated or down-regulated.

volcano_plot <- data |> 
  select(gene, log2_fold_change_avg, p_value) |>
  unique() |>
  mutate(log_10_p = -log10(p_value),
         Significance = case_when(p_value > 0.01 ~ "Not significant",
                                    p_value <= 0.01 ~ "Significant")) |> 
  ggplot(mapping = aes(x = log2_fold_change_avg,
                       y = log_10_p,
                       color = Significance)) +
  geom_point(size = 1, alpha = 0.5) +
  geom_hline(yintercept=2,
             linetype="dotted", 
             color = "black", 
             size=0.5) +
  theme(legend.position = "none") + 
  theme_minimal() +
  labs(title="Genes Associated with Metastasis in Bladder Cancer", 
       subtitle = "Genes highlighted in turquoise are significant on a significance level of 0.01",
     x = "Log2 Fold Change",
     y = "-log10(p)") 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ggsave(
  filename = "../results/05_volcano_plot.png",
  plot = volcano_plot,
  device = "png",
  height = 5,
  dpi = 300,
  bg = "white"
)
Saving 7 x 5 in image
print(volcano_plot)

dev.off()
null device 
          1